Create alert for OOMKill events inside containers #822
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Helps with #759, second attempt of #760. Also may be related to #112 and may supersede #800?
In Kubernetes 1.24, kubelet started exposing a metric that counts OOMKill events for specific containers, container_oom_events_total, which I used for this alert.
This alert will fire if there are any of these OOMKill events in a container. Multi-process containers like webservers that have multiple "worker" process could silently be OOMKilled without this. I have personally seen a pod running Gunicorn throw a 100% error rate due to OOMKills that 1) didn't show up in app-level monitoring, since the workers died before recording stats, and 2) didn't show up in any existing
kubernetes-mixin
alerts since PID1 never died.IMO this alert might be better than #800 since it's more granular (at the container and process level). The
OOMKilled
pod status may be incorrect since it just checks ifexit_code == 137
, which is caused by any SIGKILL, not just the OOMKiller.Open to suggestions!